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Abstract — Nowadays Bioinformatics, proteomics and 
Genomics are the most intriguing sciences to understand 
the human genome and diseases. Several hereditary genetic 
diseases like Retinoblastoma involve a sequence of complex 
interactions between multiple biological processes. With 
this paper, genetic similarities were found within a selected 
group of patient’s DNA sequences through the use of signal 
processing tools. DNA, RNA and protein sequences have 
similarities in structure and function of the gene with their 
location. In this paper, we introduce a novel method using 
scoring matrix and wavelet windowing, for the integrative 
gene prediction. The proposed methods not only integrate 
multiple genomic data but can be used to predict gene 
location, gene mutation and genetic disorder from the 
multi-block genomic data. The performance was assessed 
by simulation. 
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I. INTRODUCTION 

Retinoblastoma is a malignant cancer of the increasing 
retinal cells caused in the majority cases by mutations in 
both copies of the RBI gene. The RBI gene is a tumor 
suppressor gene, located on the genetic material, 
chromosome 13ql4 and is the first cloned human cancer 
gene. The gene codes for the tumor suppressor protein 
pRB, which by binding to the transcription factor E2F, 
inhibits the cell from entering the S -phase during mitosis. 
Latest facts about retinoblastoma suggests that post-mitotic 
cone precursors are uniquely sensitive to pRB depletion and 
may be the cells in which retinoblastoma originates. The 
occurrence and viability of retinoblastic cells may be more 
complex than suggested by simple loss of function of 
the RBI alleles. Hereditary retinoblastoma demonstrate 
close relation of the gene for this cancer with genetic locus 
for esterase D. Data are presented here in support of the 
hypothesis that at least one disease, the retinoblastoma 
observed in children is caused by two mutational events. 



Fig.l: Healthy eye 



Fig. 2: Retinoblastoma affected eye 



Fig. 3: Flow graph of gene mutation in Eye 
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In 5% of retinoblastoma cases with germline mutations the 
ancestor history is positive. The risk for developing bilateral 
and multifocal retinoblastoma is high and the age of 
inception is around 15 months. The mean number of 
tumors is about 5 in the two eyes. The offspring of a parent 
with bilateral retinoblastoma have a 50% probability of 
developing a tumor and 50% possibility of inheriting the 
germline mutant allele. Reduced reentrance of 10 to 15% 
lowers the estimated occurrence of disease from 50% to 
25%. Individuals who have mutations in both alleles 
somatically do not have a mutation in their germ cells and 
therefore usually transfer no tumor risk to their offspring. 

II. DNA SEQUENCE 

Deoxyribonucleic acid (DNA) and ribonucleic acid (RNA) 
are consisting of a nucleobases, a pentose sugar and a 
phosphate group. DNA nucleobases are Cytosine (C), 
Guanine (G), Adenine (A) and Thymine (T) and RNA 
nucleobases are Cytosine (C), Guanine (G), Adenine (A) 
and Uracil (U )[1] In recent years huge databases available 
for genetic information as open source which lead to a huge 
progress in bioinformatics; if a genetic sequences are 
known then this information could be a very important in 
early disease diagnosis, drug discovery for it. [2] It leads to 
Biological sequence alignment a field of Bioinformatics and 
Computational Biology. It’s aim analyzing similarities 
between DNA, RNA or protein sequences, to predict the 
genetic relationship between organisms and structural or 
functional relationships. 

Each segment in DNA is called a gene. Genes control the 
protein synthesis and regulate most of the activities inside a 
living organism. All the genetic information is copied when 
a cell divides. When a change occurs in the base sequence 
of a DNA strand, it is called a mutation. These mutations 
can lead to diseases or the death of a cell. 




The numerical representation of DNA sequences becomes 
very essential as almost all DSP techniques require two 
parts: mapping the symbolic sequence into a numeric and 
calculating a kind of transform of the resultant numeric 
series [2]. Most of the numerical representations associate 
one numerical value to one position in the sequence using 
numerical values related to each nucleotide and, finally, 
reveal the existence or the nonexistence of a certain 
nucleotide in a specific position [3]. Another approach 
could be to include information about the number and type 
of repeated nucleotides to generate only one numerical 
value for each DNA subsequence which may be associated 
with a recur. This representation needs a mapping algorithm 
which use distances to determine similar subsequences and 
then evaluate a consensus sequence for these subsequences 
to generate candidates. 
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Fig. 5: DNA Helical Structure 


III. GENE PREDICTION 

Gene Prediction refers to detect the locations of the protein- 
coding regions of genes in a lengthy DNA sequence. Signal 
processing techniques offer a huge guarantee in analyzing 
genomic data because of its digital nature. Signal processing 
analysis of bio-molecular sequences is stalled by their 
representation as strings of alphabet characters. 


Table. 1: Genetic code 
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IV. NUMERICAL REPRESENTATION 

The arithmetical depiction of a DNA sequence is given as a 
chain of integers derived from a unique graphical 
representation of the regular hereditary code. This 
numerical representation is appropriate for the quantitative 
analysis of the sequences. 

4.1 LD matrix 

LD matrix is used to calculate linkage disequilibrium 
values. "composite" for LD composite measure, "r" for R 
coefficient (by EM algorithm), "dprime" for D', and "corr" 
for correlation coefficient. The method "corr" is equivalent 
to "composite", when SNP genotypes are coded as: 0 - BB, 
1 - AB, 2 - AA. Matrix elements adjacent to the main 
diagonal represent the extent of the line segments producing 
the line. 

4.2 Transition matrix 

Transition matrix is used for transitions from one kind of 
base to another. For a given DNA sequence ‘s’ it can 
construct a 4x4 matrix A = (tij), where tij means the number 
of times a given base being succeeded by another in the 
sequence. A is called the transition frequency matrix of s. 
We can construct a matrix P = (Pij) by dividing each 
element by the total of all entries in A. Such a matrix 
represents the relative frequency of all the possible types of 
transitions, and is called the transition proportion matrix of 
s. The initial mapping of DNA to binary which represents 
DNA with four binary indicator sequences showing the 
presence ‘1’ and absence ‘0’ of the relevant nucleotides at 
locations 'n'. 

4.3 Complex representation 

The complex representation is based on the assumption that 
coefficients of the four 3-D tetrahedron vectors representing 
each DNA letter are either +1 or -1. The dimensionality of 
the resultant bipolar representation can be condensed to 
two. 

V. DSP TECHNIQUES 

A. DFT 

Fourier transform is used to detect the likely coding regions 
in DNA sequences, by computing the amplitude profile of 
this spectral component which is a sharp peak at frequency f 
= 1/3 in the power spectrum. The strength of the peak 
depends obviously on the repetition of gene. This gives 
relatively good results but it is dependent on DNA sequence 
and thus requires computation before processing of the 
mapping scheme for gene prediction. The DNA sequence to 
be generated from a white random process through an all 


pole system and thus used Auto-Regressive modeling to 
replace Fourier analysis for exon prediction. 

B. STFT 

In non- stationary signals, The Short Time Fourier 
Transform (STFT) is an algorithm frequently used for the 
DFT-based spectral analysis. In the STFT, the time signal is 
divided into short segments and a DFT is calculated for 
each one of these segments. Spectrogram, a three 
dimensional graph called is obtained by plotting the squared 
magnitude of the DFT coefficients as a function of time. 

C. DWT 

The Discrete Wavelet Transform is a mathematical tool that 
can be used very effectively for non- stationary signal 
analysis. The DWT, for which an algorithm called Fast 
Wavelet Transforms (FWT) allows a very efficient 
calculation. Methods based on a modified Gabor-wavelet 
transform (MGWT) for the identification of protein coding 
regions also exists. 

VI. WAVELET WINDOW METHOD 

A Wavelet Transform Modulus Maxima (WTMM) is 
defined as a point (xo, to) such that 

1W xo,tl<lW x0 ,tol 

when t belongs to either a right or the left neighbourhood 
of to, and 

1W xo,tl<lW xO,tol 

when t belongs to the other side of the neighbourhood of to. 
We describe maxima line, any connected curve in the scale 
space (x, t) along which all points are WTMM. 

VII. RESULTS AND DISCUSSIONS 



Fig.l: Best of DNA seq 
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Fig. 2: Fourier spectrum of DN A sequence 


Fig. 3: Power spectrum SNR of DN A sequence using WWM 
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Fig. 4: Spectrum analysis of normal and abnormal DNA 
sequence 



Fig. 5: PDF of given sequence 


VIII. CONCLUSION 

In this work, a new analyzing wavelet window and scoring 
matrix method for the prediction of protein coding regions 
has been proposed. The wavelet window method can be 
applied to predict different coding regions of different 
lengths. The selection of the value of the window length has 
always been a problem in DSP based methods as it has an 
effect on the gene prediction. Future work can focus on 
integrating this technique to refine the predicted location of 
gene and protein coding regions. 
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